Chronic Kidney Disease Analysis¶

In this EDA project, we'll dive into the world of chronic kidney disease (CKD) using a dataset from Kaggle. Analyzing health data can seem a bit intimidating, but it's a critical task that helps us understand complex medical conditions.

For doctors and healthcare professionals, diagnosing a chronic disease isn't always straightforward.

It requires a careful look at many different factors, from a patient's lab results to their overall health indicators.

This process is made even more challenging when trying to identify patterns that predict the onset of a disease.

This is where we, as data analysts, can help! Through this Exploratory Data Analysis (EDA) project, we will clean, analyze, and visualize the chronic kidney disease dataset.

Our goal is to uncover valuable insights and correlations that could assist in identifying the key factors associated with the disease. By exploring this data, we hope to make the task of early detection and risk assessment a little less daunting.

In [1]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

import warnings
warnings.filterwarnings('ignore')
In [2]:
df = pd.read_csv('Kidney_disease.csv')
df
Out[2]:
id age bp sg al su rbc pc pcc ba ... pcv wc rc htn dm cad appet pe ane classification
0 0 48.0 80.0 1.020 1.0 0.0 NaN normal notpresent notpresent ... 44 7800 5.2 yes yes no good no no ckd
1 1 7.0 50.0 1.020 4.0 0.0 NaN normal notpresent notpresent ... 38 6000 NaN no no no good no no ckd
2 2 62.0 80.0 1.010 2.0 3.0 normal normal notpresent notpresent ... 31 7500 NaN no yes no poor no yes ckd
3 3 48.0 70.0 1.005 4.0 0.0 normal abnormal present notpresent ... 32 6700 3.9 yes no no poor yes yes ckd
4 4 51.0 80.0 1.010 2.0 0.0 normal normal notpresent notpresent ... 35 7300 4.6 no no no good no no ckd
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
395 395 55.0 80.0 1.020 0.0 0.0 normal normal notpresent notpresent ... 47 6700 4.9 no no no good no no notckd
396 396 42.0 70.0 1.025 0.0 0.0 normal normal notpresent notpresent ... 54 7800 6.2 no no no good no no notckd
397 397 12.0 80.0 1.020 0.0 0.0 normal normal notpresent notpresent ... 49 6600 5.4 no no no good no no notckd
398 398 17.0 60.0 1.025 0.0 0.0 normal normal notpresent notpresent ... 51 7200 5.9 no no no good no no notckd
399 399 58.0 80.0 1.025 0.0 0.0 normal normal notpresent notpresent ... 53 6800 6.1 no no no good no no notckd

400 rows × 26 columns

In [3]:
df.head()
Out[3]:
id age bp sg al su rbc pc pcc ba ... pcv wc rc htn dm cad appet pe ane classification
0 0 48.0 80.0 1.020 1.0 0.0 NaN normal notpresent notpresent ... 44 7800 5.2 yes yes no good no no ckd
1 1 7.0 50.0 1.020 4.0 0.0 NaN normal notpresent notpresent ... 38 6000 NaN no no no good no no ckd
2 2 62.0 80.0 1.010 2.0 3.0 normal normal notpresent notpresent ... 31 7500 NaN no yes no poor no yes ckd
3 3 48.0 70.0 1.005 4.0 0.0 normal abnormal present notpresent ... 32 6700 3.9 yes no no poor yes yes ckd
4 4 51.0 80.0 1.010 2.0 0.0 normal normal notpresent notpresent ... 35 7300 4.6 no no no good no no ckd

5 rows × 26 columns

In [4]:
df.tail()
Out[4]:
id age bp sg al su rbc pc pcc ba ... pcv wc rc htn dm cad appet pe ane classification
395 395 55.0 80.0 1.020 0.0 0.0 normal normal notpresent notpresent ... 47 6700 4.9 no no no good no no notckd
396 396 42.0 70.0 1.025 0.0 0.0 normal normal notpresent notpresent ... 54 7800 6.2 no no no good no no notckd
397 397 12.0 80.0 1.020 0.0 0.0 normal normal notpresent notpresent ... 49 6600 5.4 no no no good no no notckd
398 398 17.0 60.0 1.025 0.0 0.0 normal normal notpresent notpresent ... 51 7200 5.9 no no no good no no notckd
399 399 58.0 80.0 1.025 0.0 0.0 normal normal notpresent notpresent ... 53 6800 6.1 no no no good no no notckd

5 rows × 26 columns

In [5]:
df.sample(5)
Out[5]:
id age bp sg al su rbc pc pcc ba ... pcv wc rc htn dm cad appet pe ane classification
272 272 56.0 80.0 1.025 0.0 0.0 normal normal notpresent notpresent ... 42 5600 5.5 no no no good no no notckd
263 263 45.0 80.0 1.020 0.0 0.0 normal normal notpresent notpresent ... 45 8600 5.2 no no no good no no notckd
371 371 28.0 60.0 1.025 0.0 0.0 normal normal notpresent notpresent ... 51 6500 5.0 no no no good no no notckd
226 226 64.0 100.0 1.015 4.0 2.0 abnormal abnormal notpresent present ... 26 7500 3.4 yes yes no good yes no ckd
220 220 36.0 80.0 1.010 0.0 0.0 NaN normal notpresent notpresent ... 36 8800 NaN no no no good no no ckd

5 rows × 26 columns

In [6]:
df.shape
Out[6]:
(400, 26)
In [7]:
df.dtypes
Out[7]:
id                  int64
age               float64
bp                float64
sg                float64
al                float64
su                float64
rbc                object
pc                 object
pcc                object
ba                 object
bgr               float64
bu                float64
sc                float64
sod               float64
pot               float64
hemo              float64
pcv                object
wc                 object
rc                 object
htn                object
dm                 object
cad                object
appet              object
pe                 object
ane                object
classification     object
dtype: object
In [8]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 26 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              400 non-null    int64  
 1   age             391 non-null    float64
 2   bp              388 non-null    float64
 3   sg              353 non-null    float64
 4   al              354 non-null    float64
 5   su              351 non-null    float64
 6   rbc             248 non-null    object 
 7   pc              335 non-null    object 
 8   pcc             396 non-null    object 
 9   ba              396 non-null    object 
 10  bgr             356 non-null    float64
 11  bu              381 non-null    float64
 12  sc              383 non-null    float64
 13  sod             313 non-null    float64
 14  pot             312 non-null    float64
 15  hemo            348 non-null    float64
 16  pcv             330 non-null    object 
 17  wc              295 non-null    object 
 18  rc              270 non-null    object 
 19  htn             398 non-null    object 
 20  dm              398 non-null    object 
 21  cad             398 non-null    object 
 22  appet           399 non-null    object 
 23  pe              399 non-null    object 
 24  ane             399 non-null    object 
 25  classification  400 non-null    object 
dtypes: float64(11), int64(1), object(14)
memory usage: 81.4+ KB
In [9]:
df.describe()
Out[9]:
id age bp sg al su bgr bu sc sod pot hemo
count 400.000000 391.000000 388.000000 353.000000 354.000000 351.000000 356.000000 381.000000 383.000000 313.000000 312.000000 348.000000
mean 199.500000 51.483376 76.469072 1.017408 1.016949 0.450142 148.036517 57.425722 3.072454 137.528754 4.627244 12.526437
std 115.614301 17.169714 13.683637 0.005717 1.352679 1.099191 79.281714 50.503006 5.741126 10.408752 3.193904 2.912587
min 0.000000 2.000000 50.000000 1.005000 0.000000 0.000000 22.000000 1.500000 0.400000 4.500000 2.500000 3.100000
25% 99.750000 42.000000 70.000000 1.010000 0.000000 0.000000 99.000000 27.000000 0.900000 135.000000 3.800000 10.300000
50% 199.500000 55.000000 80.000000 1.020000 0.000000 0.000000 121.000000 42.000000 1.300000 138.000000 4.400000 12.650000
75% 299.250000 64.500000 80.000000 1.020000 2.000000 0.000000 163.000000 66.000000 2.800000 142.000000 4.900000 15.000000
max 399.000000 90.000000 180.000000 1.025000 5.000000 5.000000 490.000000 391.000000 76.000000 163.000000 47.000000 17.800000
In [10]:
df.isnull().sum()
Out[10]:
id                  0
age                 9
bp                 12
sg                 47
al                 46
su                 49
rbc               152
pc                 65
pcc                 4
ba                  4
bgr                44
bu                 19
sc                 17
sod                87
pot                88
hemo               52
pcv                70
wc                105
rc                130
htn                 2
dm                  2
cad                 2
appet               1
pe                  1
ane                 1
classification      0
dtype: int64
In [11]:
df.columns
Out[11]:
Index(['id', 'age', 'bp', 'sg', 'al', 'su', 'rbc', 'pc', 'pcc', 'ba', 'bgr',
       'bu', 'sc', 'sod', 'pot', 'hemo', 'pcv', 'wc', 'rc', 'htn', 'dm', 'cad',
       'appet', 'pe', 'ane', 'classification'],
      dtype='object')
In [12]:
df['id']
Out[12]:
0        0
1        1
2        2
3        3
4        4
      ... 
395    395
396    396
397    397
398    398
399    399
Name: id, Length: 400, dtype: int64
In [13]:
df.drop('id', axis = 1, inplace = True)
In [14]:
df.columns
Out[14]:
Index(['age', 'bp', 'sg', 'al', 'su', 'rbc', 'pc', 'pcc', 'ba', 'bgr', 'bu',
       'sc', 'sod', 'pot', 'hemo', 'pcv', 'wc', 'rc', 'htn', 'dm', 'cad',
       'appet', 'pe', 'ane', 'classification'],
      dtype='object')
In [15]:
df.columns = ['age', 'blood_pressure', 'specific_gravity', 'albumin', 'sugar', 'red_blood_cells', 'pus_cell',
              'pus_cell_clumps', 'bacteria', 'blood_glucose_random', 'blood_urea', 'serum_creatinine', 'sodium',
              'potassium', 'haemoglobin', 'packed_cell_volume', 'white_blood_cell_count', 'red_blood_cell_count',
              'hypertension', 'diabetes_mellitus', 'coronary_artery_disease', 'appetite', 'peda_edema',
              'aanemia', 'class']
In [16]:
df.columns
Out[16]:
Index(['age', 'blood_pressure', 'specific_gravity', 'albumin', 'sugar',
       'red_blood_cells', 'pus_cell', 'pus_cell_clumps', 'bacteria',
       'blood_glucose_random', 'blood_urea', 'serum_creatinine', 'sodium',
       'potassium', 'haemoglobin', 'packed_cell_volume',
       'white_blood_cell_count', 'red_blood_cell_count', 'hypertension',
       'diabetes_mellitus', 'coronary_artery_disease', 'appetite',
       'peda_edema', 'aanemia', 'class'],
      dtype='object')

Converting packed_cell_volume column from object --> int/float¶

In [17]:
df['packed_cell_volume'] # --> Initially object but must be integer
Out[17]:
0      44
1      38
2      31
3      32
4      35
       ..
395    47
396    54
397    49
398    51
399    53
Name: packed_cell_volume, Length: 400, dtype: object
In [18]:
df['packed_cell_volume'].unique() # --> Due to the presence of '\t?' it turns out to be string/object
Out[18]:
array(['44', '38', '31', '32', '35', '39', '36', '33', '29', '28', nan,
       '16', '24', '37', '30', '34', '40', '45', '27', '48', '\t?', '52',
       '14', '22', '18', '42', '17', '46', '23', '19', '25', '41', '26',
       '15', '21', '43', '20', '\t43', '47', '9', '49', '50', '53', '51',
       '54'], dtype=object)
In [19]:
df['packed_cell_volume'] =  pd.to_numeric(df['packed_cell_volume'], errors = 'coerce')
# coerce means supress/ignore the error
In [20]:
df['packed_cell_volume'].dtype
Out[20]:
dtype('float64')
In [21]:
df['packed_cell_volume'].unique() # All string characters are converted into nan(numeric) values
Out[21]:
array([44., 38., 31., 32., 35., 39., 36., 33., 29., 28., nan, 16., 24.,
       37., 30., 34., 40., 45., 27., 48., 52., 14., 22., 18., 42., 17.,
       46., 23., 19., 25., 41., 26., 15., 21., 43., 20., 47.,  9., 49.,
       50., 53., 51., 54.])
In [ ]:
 
In [22]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 25 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   age                      391 non-null    float64
 1   blood_pressure           388 non-null    float64
 2   specific_gravity         353 non-null    float64
 3   albumin                  354 non-null    float64
 4   sugar                    351 non-null    float64
 5   red_blood_cells          248 non-null    object 
 6   pus_cell                 335 non-null    object 
 7   pus_cell_clumps          396 non-null    object 
 8   bacteria                 396 non-null    object 
 9   blood_glucose_random     356 non-null    float64
 10  blood_urea               381 non-null    float64
 11  serum_creatinine         383 non-null    float64
 12  sodium                   313 non-null    float64
 13  potassium                312 non-null    float64
 14  haemoglobin              348 non-null    float64
 15  packed_cell_volume       329 non-null    float64
 16  white_blood_cell_count   295 non-null    object 
 17  red_blood_cell_count     270 non-null    object 
 18  hypertension             398 non-null    object 
 19  diabetes_mellitus        398 non-null    object 
 20  coronary_artery_disease  398 non-null    object 
 21  appetite                 399 non-null    object 
 22  peda_edema               399 non-null    object 
 23  aanemia                  399 non-null    object 
 24  class                    400 non-null    object 
dtypes: float64(12), object(13)
memory usage: 78.3+ KB
In [23]:
df['white_blood_cell_count'].unique()
Out[23]:
array(['7800', '6000', '7500', '6700', '7300', nan, '6900', '9600',
       '12100', '4500', '12200', '11000', '3800', '11400', '5300', '9200',
       '6200', '8300', '8400', '10300', '9800', '9100', '7900', '6400',
       '8600', '18900', '21600', '4300', '8500', '11300', '7200', '7700',
       '14600', '6300', '\t6200', '7100', '11800', '9400', '5500', '5800',
       '13200', '12500', '5600', '7000', '11900', '10400', '10700',
       '12700', '6800', '6500', '13600', '10200', '9000', '14900', '8200',
       '15200', '5000', '16300', '12400', '\t8400', '10500', '4200',
       '4700', '10900', '8100', '9500', '2200', '12800', '11200', '19100',
       '\t?', '12300', '16700', '2600', '26400', '8800', '7400', '4900',
       '8000', '12000', '15700', '4100', '5700', '11500', '5400', '10800',
       '9900', '5200', '5900', '9300', '9700', '5100', '6600'],
      dtype=object)
In [24]:
df['white_blood_cell_count'] = pd.to_numeric(df['white_blood_cell_count'], errors = 'coerce')
df['white_blood_cell_count'].dtype
Out[24]:
dtype('float64')
In [25]:
df['red_blood_cell_count'] = pd.to_numeric(df['red_blood_cell_count'], errors = 'coerce')
df['red_blood_cell_count'].dtype
Out[25]:
dtype('float64')
In [26]:
Categorical = [col for col in df.columns if df[col].dtype == 'object']
Categorical
Out[26]:
['red_blood_cells',
 'pus_cell',
 'pus_cell_clumps',
 'bacteria',
 'hypertension',
 'diabetes_mellitus',
 'coronary_artery_disease',
 'appetite',
 'peda_edema',
 'aanemia',
 'class']
In [27]:
Numerical = [col for col in df.columns if df[col].dtype != 'object']
Numerical
Out[27]:
['age',
 'blood_pressure',
 'specific_gravity',
 'albumin',
 'sugar',
 'blood_glucose_random',
 'blood_urea',
 'serum_creatinine',
 'sodium',
 'potassium',
 'haemoglobin',
 'packed_cell_volume',
 'white_blood_cell_count',
 'red_blood_cell_count']
In [28]:
for col in Categorical:
    print(f' {col} : \n {df[col].unique()}')
 red_blood_cells : 
 [nan 'normal' 'abnormal']
 pus_cell : 
 ['normal' 'abnormal' nan]
 pus_cell_clumps : 
 ['notpresent' 'present' nan]
 bacteria : 
 ['notpresent' 'present' nan]
 hypertension : 
 ['yes' 'no' nan]
 diabetes_mellitus : 
 ['yes' 'no' ' yes' '\tno' '\tyes' nan]
 coronary_artery_disease : 
 ['no' 'yes' '\tno' nan]
 appetite : 
 ['good' 'poor' nan]
 peda_edema : 
 ['no' 'yes' nan]
 aanemia : 
 ['no' 'yes' nan]
 class : 
 ['ckd' 'ckd\t' 'notckd']
In [29]:
df['diabetes_mellitus'].replace(to_replace = {' yes' : 'yes', '\tno' : 'no', '\tyes' : 'yes'}, inplace = True)
df['diabetes_mellitus'].unique()
Out[29]:
array(['yes', 'no', nan], dtype=object)
In [30]:
df['coronary_artery_disease'].replace(to_replace = {'\tno':'no'}, inplace = True)
df['coronary_artery_disease'].unique()
Out[30]:
array(['no', 'yes', nan], dtype=object)
In [31]:
df['class'].replace(to_replace = {'ckd\t':'ckd'}, inplace = True)
df['class'].unique()
Out[31]:
array(['ckd', 'notckd'], dtype=object)
In [32]:
df['class'] = df['class'].map({'ckd': 1, 'notckd': 0})

Univariet analysis¶

In [33]:
plt.figure(figsize = (10,6))
sns.histplot(df['age'].dropna(), kde = True, bins = 20)
plt.title("Distribution of age")
plt.xlabel('Age')

plt.show()
No description has been provided for this image

Insights:¶

The histogram and the corresponding KDE curve show that the distribution of age is left-skewed (or negatively skewed). This indicates that the majority of individuals in this dataset are concentrated in the older age brackets, with a smaller number of younger individuals.

In [34]:
df.columns
Out[34]:
Index(['age', 'blood_pressure', 'specific_gravity', 'albumin', 'sugar',
       'red_blood_cells', 'pus_cell', 'pus_cell_clumps', 'bacteria',
       'blood_glucose_random', 'blood_urea', 'serum_creatinine', 'sodium',
       'potassium', 'haemoglobin', 'packed_cell_volume',
       'white_blood_cell_count', 'red_blood_cell_count', 'hypertension',
       'diabetes_mellitus', 'coronary_artery_disease', 'appetite',
       'peda_edema', 'aanemia', 'class'],
      dtype='object')
In [35]:
sns.countplot(x = 'hypertension', data = df, palette = 'Set2')
Out[35]:
<Axes: xlabel='hypertension', ylabel='count'>
No description has been provided for this image

Insights:¶

The chart shows that the number of individuals without hypertension ('no') is significantly higher than the number of individuals with hypertension ('yes').

Count Comparison¶

'No' Hypertension: The count for individuals without hypertension is approximately 250.

'Yes' Hypertension: The count for individuals with hypertension is approximately 150.

This suggests that hypertension is not a universal condition in this dataset, and the majority of the population does not have it.

In [36]:
plt.figure(figsize = (10,8))
sns.boxplot(x = 'class', y = 'blood_urea', data = df, palette = 'viridis')
plt.title('Boxplot')
Out[36]:
Text(0.5, 1.0, 'Boxplot')
No description has been provided for this image

Insights:¶

There is a significant difference in blood urea levels between the two classes. The distribution for class 1 is much higher and more spread out than for class 0. This suggests a strong positive correlation between higher blood urea levels and the condition represented by class 1.

class 1 has numerous, significant outliers with extremely high blood urea levels, some reaching nearly 400 this suggest that few indivisuals of this class have extremely elevated blood urea whereas class 0 is much more tighter distribution.

In [37]:
sns.violinplot(x = 'class', y = 'serum_creatinine', data = df, palette = 'muted')
Out[37]:
<Axes: xlabel='class', ylabel='serum_creatinine'>
No description has been provided for this image
In [38]:
sns.countplot(x = 'aanemia', data = df,palette = 'pastel')
Out[38]:
<Axes: xlabel='aanemia', ylabel='count'>
No description has been provided for this image

Insights:¶

The chart shows that number of indivisuals with aanemia ("no") is much more higher than the number of indivisuals with aanemia("yes").

Count comparision¶

'No' aanemia : The count for indivisuals without aanemia is approximately 350.

'Yes' aanemia : The count for indivisuals with aanemia is slightly greator than 50.

This suggests that aanemia is not the universal condition and majority of population doesnt have it.

In [39]:
df['appetite'].unique()
Out[39]:
array(['good', 'poor', nan], dtype=object)
In [40]:
x = df['appetite'].value_counts()
x
Out[40]:
appetite
good    317
poor     82
Name: count, dtype: int64
In [41]:
plt.figure(figsize = (8,8))
plt.pie(x, labels = x.index, autopct = '%.1f%%', colors = ['lightpink','lightcoral'],explode = (0,0.1), shadow = True, startangle=90)
plt.title('Pie chart for appetite')
plt.show()
No description has been provided for this image
In [42]:
x.plot.pie(autopct = '%1.1f%%', colors = ['lightpink','lightcoral'],explode = (0,0.1), shadow = True, startangle=90)
Out[42]:
<Axes: ylabel='count'>
No description has been provided for this image

Insights:¶

The portion of population having appetite 'good' (79.4%) is much higher as compared to the portion of population having appetite 'ppor' (20.6%)

In [43]:
df['pus_cell_clumps']
Out[43]:
0      notpresent
1      notpresent
2      notpresent
3         present
4      notpresent
          ...    
395    notpresent
396    notpresent
397    notpresent
398    notpresent
399    notpresent
Name: pus_cell_clumps, Length: 400, dtype: object
In [44]:
sns.countplot(x = df['pus_cell_clumps'], palette = 'Set1')
Out[44]:
<Axes: xlabel='pus_cell_clumps', ylabel='count'>
No description has been provided for this image

Insights¶

The number of indivisuals with pus_cell_clumps ('not present') is much more higher than number of indivisuals with pus_cell_clumps ('present')

Count comparision¶

'not present pus_cell_clumps: The count of indivisuals without pus_cell_clumps is approximately 350.

'present' pus_cell_clumps: The count of indivisuals with pus_cell_clumps is nearly 50.

This indicates that pus_cell_clumps is not universal condition and majority of population doesnt have it.

In [45]:
df['white_blood_cell_count']
Out[45]:
0      7800.0
1      6000.0
2      7500.0
3      6700.0
4      7300.0
        ...  
395    6700.0
396    7800.0
397    6600.0
398    7200.0
399    6800.0
Name: white_blood_cell_count, Length: 400, dtype: float64
In [46]:
sns.histplot(df['white_blood_cell_count'].dropna(), bins = 20, kde = True, color = 'darkred')
Out[46]:
<Axes: xlabel='white_blood_cell_count', ylabel='Count'>
No description has been provided for this image

Insights:¶

The plot shows that the distribution of white blood cell count is right-skewed (or positively skewed). This means that the majority of the data points are concentrated on the lower end of the count, with a long tail extending to the right, representing a smaller number of individuals with very high counts.

In [47]:
# Donut plot - donout chart or ring chart
In [48]:
df['diabetes_mellitus'].value_counts().plot.pie(autopct = '%1.1f%%', wedgeprops = dict(width = 0.5))
Out[48]:
<Axes: ylabel='count'>
No description has been provided for this image

Insights:¶

The majority of the population, at 65.6%, does not have diabetes mellitus. The remaining 34.4% of the population does.

In [49]:
sns.countplot(x = 'coronary_artery_disease', data = df, palette = 'Set2')
Out[49]:
<Axes: xlabel='coronary_artery_disease', ylabel='count'>
No description has been provided for this image

Insights¶

The number of indivisuals with coronary_artery_disease ('no') is much more higher than number of indivisuals with coronary_artery_disease ('yes')

Count comparision 'no' coronary_artery_disease: The count of indivisuals without coronary_artery_disease is approximately 360.

'yes' coronary_artery_disease: The count of indivisuals with coronary_artery_disease is nearly 30.

This indicates that coronary_artery_disease is not universal condition and majority of population doesnt have it.

In [50]:
df.columns
Out[50]:
Index(['age', 'blood_pressure', 'specific_gravity', 'albumin', 'sugar',
       'red_blood_cells', 'pus_cell', 'pus_cell_clumps', 'bacteria',
       'blood_glucose_random', 'blood_urea', 'serum_creatinine', 'sodium',
       'potassium', 'haemoglobin', 'packed_cell_volume',
       'white_blood_cell_count', 'red_blood_cell_count', 'hypertension',
       'diabetes_mellitus', 'coronary_artery_disease', 'appetite',
       'peda_edema', 'aanemia', 'class'],
      dtype='object')
In [51]:
sns.countplot(x = 'peda_edema', data = df, palette = 'pastel')
Out[51]:
<Axes: xlabel='peda_edema', ylabel='count'>
No description has been provided for this image

Insights¶

The number of indivisuals with peda_edema ('no') is much more higher than number of indivisuals with peda_edema ('yes')

Count comparision 'no' peda_edema: The count of indivisuals without peda_edema is approximately 325.

'yes' peda_edema: The count of indivisuals with peda_edema is nearly 75.

This indicates that peda_edema is not universal condition and majority of population doesnt have it.

In [52]:
sns.countplot(x = 'bacteria', data = df, palette = 'muted')
Out[52]:
<Axes: xlabel='bacteria', ylabel='count'>
No description has been provided for this image

Bivariet Analysis¶

In [53]:
sns.scatterplot(x = 'age', y ='blood_pressure', data = df)
Out[53]:
<Axes: xlabel='age', ylabel='blood_pressure'>
No description has been provided for this image

Insights:¶

The plot shows a positive correlation between age and blood pressure. As age increases, blood pressure also tends to increase. The data points form a triangular or "cone" shape, with the spread of blood pressure values becoming wider as age increases.

In [54]:
sns.scatterplot(x = 'age', y ='blood_pressure',hue = 'class', data = df, palette = 'coolwarm')
Out[54]:
<Axes: xlabel='age', ylabel='blood_pressure'>
No description has been provided for this image
In [55]:
sns.boxplot(x = 'diabetes_mellitus' ,y = 'albumin' ,data = df, palette='muted')
Out[55]:
<Axes: xlabel='diabetes_mellitus', ylabel='albumin'>
No description has been provided for this image
In [56]:
sns.violinplot(x = 'diabetes_mellitus' ,y = 'albumin' ,data = df, palette='muted', inner = 'quartile')
Out[56]:
<Axes: xlabel='diabetes_mellitus', ylabel='albumin'>
No description has been provided for this image
In [57]:
# Stacked bar chart
In [58]:
pd.crosstab(df['diabetes_mellitus'], df['hypertension'])
Out[58]:
hypertension no yes
diabetes_mellitus
no 220 41
yes 31 106
In [59]:
diabetes_hpertension = pd.crosstab(df['diabetes_mellitus'], df['hypertension'])
diabetes_hpertension.plot(kind = 'bar', stacked = True)
Out[59]:
<Axes: xlabel='diabetes_mellitus'>
No description has been provided for this image

Multi-variate Analysis¶

In [60]:
cols = ['age', 'blood_pressure', 'blood_glucose_random', 'serum_creatinine', 'class']
df[cols]
Out[60]:
age blood_pressure blood_glucose_random serum_creatinine class
0 48.0 80.0 121.0 1.2 1
1 7.0 50.0 NaN 0.8 1
2 62.0 80.0 423.0 1.8 1
3 48.0 70.0 117.0 3.8 1
4 51.0 80.0 106.0 1.4 1
... ... ... ... ... ...
395 55.0 80.0 140.0 0.5 0
396 42.0 70.0 75.0 1.2 0
397 12.0 80.0 100.0 0.6 0
398 17.0 60.0 114.0 1.0 0
399 58.0 80.0 131.0 1.1 0

400 rows × 5 columns

In [61]:
g = sns.PairGrid(df[cols], hue = 'class', palette = 'coolwarm')
g.map_upper(sns.scatterplot)
g.map_lower(sns.kdeplot, cmap = 'Blues_d')
g.map_diag(sns.histplot)
g.add_legend()
plt.title('PairGrid for selected columns')
plt.show()
No description has been provided for this image
In [62]:
df.corr(numeric_only=True)
Out[62]:
age blood_pressure specific_gravity albumin sugar blood_glucose_random blood_urea serum_creatinine sodium potassium haemoglobin packed_cell_volume white_blood_cell_count red_blood_cell_count class
age 1.000000 0.159480 -0.191096 0.122091 0.220866 0.244992 0.196985 0.132531 -0.100046 0.058377 -0.192928 -0.242119 0.118339 -0.268896 0.227268
blood_pressure 0.159480 1.000000 -0.218836 0.160689 0.222576 0.160193 0.188517 0.146222 -0.116422 0.075151 -0.306540 -0.326319 0.029753 -0.261936 0.294077
specific_gravity -0.191096 -0.218836 1.000000 -0.469760 -0.296234 -0.374710 -0.314295 -0.361473 0.412190 -0.072787 0.602582 0.603560 -0.236215 0.579476 -0.732163
albumin 0.122091 0.160689 -0.469760 1.000000 0.269305 0.379464 0.453528 0.399198 -0.459896 0.129038 -0.634632 -0.611891 0.231989 -0.566437 0.627090
sugar 0.220866 0.222576 -0.296234 0.269305 1.000000 0.717827 0.168583 0.223244 -0.131776 0.219450 -0.224775 -0.239189 0.184893 -0.237448 0.344070
blood_glucose_random 0.244992 0.160193 -0.374710 0.379464 0.717827 1.000000 0.143322 0.114875 -0.267848 0.066966 -0.306189 -0.301385 0.150015 -0.281541 0.419672
blood_urea 0.196985 0.188517 -0.314295 0.453528 0.168583 0.143322 1.000000 0.586368 -0.323054 0.357049 -0.610360 -0.607621 0.050462 -0.579087 0.380605
serum_creatinine 0.132531 0.146222 -0.361473 0.399198 0.223244 0.114875 0.586368 1.000000 -0.690158 0.326107 -0.401670 -0.404193 -0.006390 -0.400852 0.299969
sodium -0.100046 -0.116422 0.412190 -0.459896 -0.131776 -0.267848 -0.323054 -0.690158 1.000000 0.097887 0.365183 0.376914 0.007277 0.344873 -0.375674
potassium 0.058377 0.075151 -0.072787 0.129038 0.219450 0.066966 0.357049 0.326107 0.097887 1.000000 -0.133746 -0.163182 -0.105576 -0.158309 0.084541
haemoglobin -0.192928 -0.306540 0.602582 -0.634632 -0.224775 -0.306189 -0.610360 -0.401670 0.365183 -0.133746 1.000000 0.895382 -0.169413 0.798880 -0.768919
packed_cell_volume -0.242119 -0.326319 0.603560 -0.611891 -0.239189 -0.301385 -0.607621 -0.404193 0.376914 -0.163182 0.895382 1.000000 -0.197022 0.791625 -0.741427
white_blood_cell_count 0.118339 0.029753 -0.236215 0.231989 0.184893 0.150015 0.050462 -0.006390 0.007277 -0.105576 -0.169413 -0.197022 1.000000 -0.158163 0.231919
red_blood_cell_count -0.268896 -0.261936 0.579476 -0.566437 -0.237448 -0.281541 -0.579087 -0.400852 0.344873 -0.158309 0.798880 0.791625 -0.158163 1.000000 -0.699089
class 0.227268 0.294077 -0.732163 0.627090 0.344070 0.419672 0.380605 0.299969 -0.375674 0.084541 -0.768919 -0.741427 0.231919 -0.699089 1.000000
In [63]:
plt.figure(figsize= (10,8))
sns.heatmap(df.corr(numeric_only=True), cmap = 'coolwarm', annot= True)
Out[63]:
<Axes: >
No description has been provided for this image
In [64]:
sns.swarmplot(x = 'diabetes_mellitus', y = 'age', hue = 'hypertension', data = df, palette = 'pastel', size=8)
Out[64]:
<Axes: xlabel='diabetes_mellitus', ylabel='age'>
No description has been provided for this image
In [65]:
fig = px.scatter(df, x = 'age', y = 'blood_pressure', color = 'class', hover_data = ['serum_creatinine', 'haemoglobin'], title = 'Interactive scatter plot')
fig.show()
In [66]:
fig = px.scatter_3d(df, x = 'age', y = 'blood_pressure', z = 'serum_creatinine', color = 'class', title = '3D SCATTER PLOT')
fig.show()
In [67]:
fig = px.scatter_3d(df, x = 'age', y = 'blood_pressure', z = 'serum_creatinine', color = 'haemoglobin', title = '3D SCATTER PLOT')
fig.show()
In [68]:
import plotly.graph_objects as go

corr = df.corr(numeric_only=True)
fig = go.Figure(data = go.Heatmap(z = corr.values, x = corr.columns, y = corr.index))
fig.show()
In [69]:
df.isnull().sum()
Out[69]:
age                          9
blood_pressure              12
specific_gravity            47
albumin                     46
sugar                       49
red_blood_cells            152
pus_cell                    65
pus_cell_clumps              4
bacteria                     4
blood_glucose_random        44
blood_urea                  19
serum_creatinine            17
sodium                      87
potassium                   88
haemoglobin                 52
packed_cell_volume          71
white_blood_cell_count     106
red_blood_cell_count       131
hypertension                 2
diabetes_mellitus            2
coronary_artery_disease      2
appetite                     1
peda_edema                   1
aanemia                      1
class                        0
dtype: int64
In [70]:
Categorical
Out[70]:
['red_blood_cells',
 'pus_cell',
 'pus_cell_clumps',
 'bacteria',
 'hypertension',
 'diabetes_mellitus',
 'coronary_artery_disease',
 'appetite',
 'peda_edema',
 'aanemia',
 'class']
In [71]:
Numerical
Out[71]:
['age',
 'blood_pressure',
 'specific_gravity',
 'albumin',
 'sugar',
 'blood_glucose_random',
 'blood_urea',
 'serum_creatinine',
 'sodium',
 'potassium',
 'haemoglobin',
 'packed_cell_volume',
 'white_blood_cell_count',
 'red_blood_cell_count']
In [72]:
median = df[Numerical].median()
median
Out[72]:
age                         55.00
blood_pressure              80.00
specific_gravity             1.02
albumin                      0.00
sugar                        0.00
blood_glucose_random       121.00
blood_urea                  42.00
serum_creatinine             1.30
sodium                     138.00
potassium                    4.40
haemoglobin                 12.65
packed_cell_volume          40.00
white_blood_cell_count    8000.00
red_blood_cell_count         4.80
dtype: float64
In [73]:
df[Numerical] = df[Numerical].fillna(median)
In [74]:
df[Numerical].isnull().sum()
Out[74]:
age                       0
blood_pressure            0
specific_gravity          0
albumin                   0
sugar                     0
blood_glucose_random      0
blood_urea                0
serum_creatinine          0
sodium                    0
potassium                 0
haemoglobin               0
packed_cell_volume        0
white_blood_cell_count    0
red_blood_cell_count      0
dtype: int64
In [75]:
mode = df[Categorical].mode().iloc(0)
mode
Out[75]:
<pandas.core.indexing._iLocIndexer at 0x2b655404aa0>
In [76]:
df[Categorical] = df[Categorical].fillna(mode)
In [77]:
df[Categorical].isna().sum()
Out[77]:
red_blood_cells            0
pus_cell                   0
pus_cell_clumps            0
bacteria                   0
hypertension               0
diabetes_mellitus          0
coronary_artery_disease    0
appetite                   0
peda_edema                 0
aanemia                    0
class                      0
dtype: int64
In [78]:
df.dtypes
Out[78]:
age                        float64
blood_pressure             float64
specific_gravity           float64
albumin                    float64
sugar                      float64
red_blood_cells             object
pus_cell                    object
pus_cell_clumps             object
bacteria                    object
blood_glucose_random       float64
blood_urea                 float64
serum_creatinine           float64
sodium                     float64
potassium                  float64
haemoglobin                float64
packed_cell_volume         float64
white_blood_cell_count     float64
red_blood_cell_count       float64
hypertension                object
diabetes_mellitus           object
coronary_artery_disease     object
appetite                    object
peda_edema                  object
aanemia                     object
class                        int64
dtype: object
In [82]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
for col in Categorical:
    # Convert the column to string type before fitting the encoder
    df[col] = label_encoder.fit_transform(df[col].astype(str))
In [83]:
df.dtypes
Out[83]:
age                        float64
blood_pressure             float64
specific_gravity           float64
albumin                    float64
sugar                      float64
red_blood_cells              int32
pus_cell                     int32
pus_cell_clumps              int32
bacteria                     int32
blood_glucose_random       float64
blood_urea                 float64
serum_creatinine           float64
sodium                     float64
potassium                  float64
haemoglobin                float64
packed_cell_volume         float64
white_blood_cell_count     float64
red_blood_cell_count       float64
hypertension                 int32
diabetes_mellitus            int32
coronary_artery_disease      int32
appetite                     int32
peda_edema                   int32
aanemia                      int32
class                        int32
dtype: object
In [84]:
df[Categorical]
Out[84]:
red_blood_cells pus_cell pus_cell_clumps bacteria hypertension diabetes_mellitus coronary_artery_disease appetite peda_edema aanemia class
0 0 2 1 1 2 2 1 1 1 1 1
1 0 2 1 1 1 1 1 1 1 1 1
2 2 2 1 1 1 2 1 2 1 2 1
3 2 1 2 1 2 1 1 2 2 2 1
4 2 2 1 1 1 1 1 1 1 1 1
... ... ... ... ... ... ... ... ... ... ... ...
395 2 2 1 1 1 1 1 1 1 1 0
396 2 2 1 1 1 1 1 1 1 1 0
397 2 2 1 1 1 1 1 1 1 1 0
398 2 2 1 1 1 1 1 1 1 1 0
399 2 2 1 1 1 1 1 1 1 1 0

400 rows × 11 columns

In [85]:
df.head()
Out[85]:
age blood_pressure specific_gravity albumin sugar red_blood_cells pus_cell pus_cell_clumps bacteria blood_glucose_random ... packed_cell_volume white_blood_cell_count red_blood_cell_count hypertension diabetes_mellitus coronary_artery_disease appetite peda_edema aanemia class
0 48.0 80.0 1.020 1.0 0.0 0 2 1 1 121.0 ... 44.0 7800.0 5.2 2 2 1 1 1 1 1
1 7.0 50.0 1.020 4.0 0.0 0 2 1 1 121.0 ... 38.0 6000.0 4.8 1 1 1 1 1 1 1
2 62.0 80.0 1.010 2.0 3.0 2 2 1 1 423.0 ... 31.0 7500.0 4.8 1 2 1 2 1 2 1
3 48.0 70.0 1.005 4.0 0.0 2 1 2 1 117.0 ... 32.0 6700.0 3.9 2 1 1 2 2 2 1
4 51.0 80.0 1.010 2.0 0.0 2 2 1 1 106.0 ... 35.0 7300.0 4.6 1 1 1 1 1 1 1

5 rows × 25 columns

In [ ]: